We have data on orders from our store for about a year. We want to make a product range analysis as well as some exploratory data analysis to learn more about our users behaviour.
Goals of the Project
These are the questions that we need to answer:
# updating libraries
!pip install matplotlib --user --quiet
!pip install plotly.express --user --quiet
!pip install plotly --user --quiet
# importing libraries
import pandas as pd
import numpy as np
import datetime as dt
import matplotlib.pyplot as plt
import seaborn as sns
import re
import plotly.graph_objects as go
from itertools import combinations
from collections import Counter
from plotly.subplots import make_subplots
import plotly.graph_objects as go
import plotly.express as px
from scipy import stats as st
# reading the data
try:
orders_orig = pd.read_csv(r"C:\Users\laura.struempler\Downloads\ecommerce_dataset_us\ecommerce_dataset_us.csv", sep='\t')
except:
orders_orig = pd.read_csv(r"/datasets/ecommerce_dataset_us.csv", sep='\t')
orders = orders_orig.copy()
# getting some general information on size, datatypes and values
print(orders.info())
display(orders.head())
print()
display(orders.describe(include=["int", "float"]))
print()
display(orders.describe(include="object"))
<class 'pandas.core.frame.DataFrame'> RangeIndex: 541909 entries, 0 to 541908 Data columns (total 7 columns): # Column Non-Null Count Dtype --- ------ -------------- ----- 0 InvoiceNo 541909 non-null object 1 StockCode 541909 non-null object 2 Description 540455 non-null object 3 Quantity 541909 non-null int64 4 InvoiceDate 541909 non-null object 5 UnitPrice 541909 non-null float64 6 CustomerID 406829 non-null float64 dtypes: float64(2), int64(1), object(4) memory usage: 28.9+ MB None
| InvoiceNo | StockCode | Description | Quantity | InvoiceDate | UnitPrice | CustomerID | |
|---|---|---|---|---|---|---|---|
| 0 | 536365 | 85123A | WHITE HANGING HEART T-LIGHT HOLDER | 6 | 11/29/2018 08:26 | 2.55 | 17850.0 |
| 1 | 536365 | 71053 | WHITE METAL LANTERN | 6 | 11/29/2018 08:26 | 3.39 | 17850.0 |
| 2 | 536365 | 84406B | CREAM CUPID HEARTS COAT HANGER | 8 | 11/29/2018 08:26 | 2.75 | 17850.0 |
| 3 | 536365 | 84029G | KNITTED UNION FLAG HOT WATER BOTTLE | 6 | 11/29/2018 08:26 | 3.39 | 17850.0 |
| 4 | 536365 | 84029E | RED WOOLLY HOTTIE WHITE HEART. | 6 | 11/29/2018 08:26 | 3.39 | 17850.0 |
| UnitPrice | CustomerID | |
|---|---|---|
| count | 541909.000000 | 406829.000000 |
| mean | 4.611114 | 15287.690570 |
| std | 96.759853 | 1713.600303 |
| min | -11062.060000 | 12346.000000 |
| 25% | 1.250000 | 13953.000000 |
| 50% | 2.080000 | 15152.000000 |
| 75% | 4.130000 | 16791.000000 |
| max | 38970.000000 | 18287.000000 |
| InvoiceNo | StockCode | Description | InvoiceDate | |
|---|---|---|---|---|
| count | 541909 | 541909 | 540455 | 541909 |
| unique | 25900 | 4070 | 4223 | 23260 |
| top | 573585 | 85123A | WHITE HANGING HEART T-LIGHT HOLDER | 10/29/2019 14:41 |
| freq | 1114 | 2313 | 2369 | 1114 |
We have 541909 entries in our dataset, in 7 columns. The columns have information on the Invoice Number, the stock code for the item and a description, the number of items that were ordered, the timestamp of the order, the unit price and the customer id.
We can already see that we will need to change some datatypes (timestamp to datetime for example) and changing the descriptions to lowercase to make sure that everything is written the same way.
The mean price for our items is 4.61, but since we also have negative values (probably from returned items) this value isn't reliable.
We also seem to have some missing descriptions and orders without customer ids, we will tackle this problem in the next steps.
The most popular item is "WHITE HANGING HEART T-LIGHT HOLDER", it was listed 2369 times.
# renaming the columns to more convenient ones
orders.columns = ['order_id', 'code', 'desc', 'qty', 'timestamp', 'unit_price', 'customer_id']
# changing timestamp to datetime object
orders['timestamp'] = pd.to_datetime(orders['timestamp'])
# changing the descriptions to all lowercase letters
orders['desc'] = orders['desc'].str.lower()
orders['code'] = orders['code'].str.lower()
We changed the column names, converted the timestamp into a datetime object and made sure that all strings are written in only lowercase letters.
Let's find out in which columns we have missing values and try to fill them, or decide to drop them.
# finding the count and share of missing values in each column
missing = orders.isnull().sum().reset_index()
share = (orders.isnull().sum() / orders.count() * 100).reset_index()
# merging the information into one table with
missing_values = missing.merge(share, on='index')
missing_values.columns = ['column', 'missing total', 'missing share']
# displaying the number of missing values
display(missing_values)
| column | missing total | missing share | |
|---|---|---|---|
| 0 | order_id | 0 | 0.000000 |
| 1 | code | 0 | 0.000000 |
| 2 | desc | 1454 | 0.269033 |
| 3 | qty | 0 | 0.000000 |
| 4 | timestamp | 0 | 0.000000 |
| 5 | unit_price | 0 | 0.000000 |
| 6 | customer_id | 135080 | 33.203139 |
There are 1.454 missing values in the description column, which equals 0.27%. We will try to fill them, but since the number is so low it probably wouldn't affect the outcome of our analysis very much if we dropped them.
For customer_id on the other hand there are a lot more missing values. In 135.080 rows or 33.2% of our entries we don't have a customer id. There could be several reasons for this. Customers might be able to place an order without registering on our site or the "orders" are not real orders but internal positions like fees or inventory adjustements.
# checking if there are other values missing as well where the description is missing
missing_desc = orders[orders['desc'].isnull()]
missing_desc.isnull().sum()
order_id 0 code 0 desc 1454 qty 0 timestamp 0 unit_price 0 customer_id 1454 dtype: int64
# look at a few examples where the description is missing
missing_desc.sample(10, random_state=0)
| order_id | code | desc | qty | timestamp | unit_price | customer_id | |
|---|---|---|---|---|---|---|---|
| 143343 | 548694 | 22720 | NaN | 10 | 2019-03-30 17:28:00 | 0.0 | NaN |
| 538554 | 581408 | 85175 | NaN | 20 | 2019-12-06 14:06:00 | 0.0 | NaN |
| 382679 | 569934 | 85049d | NaN | 1 | 2019-10-04 17:51:00 | 0.0 | NaN |
| 143303 | 548677 | 72781 | NaN | -30 | 2019-03-30 16:40:00 | 0.0 | NaN |
| 299513 | 563101 | 22084 | NaN | 1 | 2019-08-10 10:07:00 | 0.0 | NaN |
| 173268 | 551669 | 21355 | NaN | 90 | 2019-05-01 12:38:00 | 0.0 | NaN |
| 279310 | 561271 | 10080 | NaN | 170 | 2019-07-24 12:18:00 | 0.0 | NaN |
| 440210 | 574518 | 21174 | NaN | 63 | 2019-11-02 13:42:00 | 0.0 | NaN |
| 171492 | 551422 | 22084 | NaN | -150 | 2019-04-26 14:11:00 | 0.0 | NaN |
| 332408 | 566068 | 22268 | NaN | -86 | 2019-09-06 16:37:00 | 0.0 | NaN |
# checking if there are other items with the same order_id
len(orders[orders['order_id'].isin(missing_desc['order_id'])])
1454
# removing the entries
orders = orders[~orders['order_id'].isin(missing_desc['order_id'])]
We found 1454 missing descriptions. For all of these orders the customer_id is missing as well and there are no other items with the same order_id, so the information can't be restored. We could fill the description based on the stock code, but we still wouldn't know to which customer the order belongs so we dropped those rows entierly.
# finding the orders that don't have customer ids
missing_cust = orders[orders['customer_id'].isnull()]
len(missing_cust)
133626
# checking if there are other items with the same order id that might have a customer id
len(orders[orders['order_id'].isin(missing_cust['order_id'])])
133626
Unfortunately we couldn't find any positions with the same order ids, so there is no way to logically fill the missing values. Since the number of rows is very high and there could be valid reasons we filled them with "Unknown". This way we can still take them into account.
# checking some duplicates to see what their nature is
orders[orders.duplicated(keep=False)].sort_values(['customer_id', 'desc']).head(12)
| order_id | code | desc | qty | timestamp | unit_price | customer_id | |
|---|---|---|---|---|---|---|---|
| 395371 | 571034 | 23245 | set of 3 regency cake tins | 4 | 2019-10-11 12:47:00 | 4.95 | 12359.0 |
| 395455 | 571034 | 23245 | set of 3 regency cake tins | 4 | 2019-10-11 12:47:00 | 4.95 | 12359.0 |
| 395388 | 571034 | 23239 | set of 4 knick knack tins poppies | 6 | 2019-10-11 12:47:00 | 4.15 | 12359.0 |
| 395410 | 571034 | 23239 | set of 4 knick knack tins poppies | 6 | 2019-10-11 12:47:00 | 4.15 | 12359.0 |
| 395442 | 571034 | 23494 | vintage doily deluxe sewing kit | 3 | 2019-10-11 12:47:00 | 5.95 | 12359.0 |
| 395443 | 571034 | 23494 | vintage doily deluxe sewing kit | 3 | 2019-10-11 12:47:00 | 5.95 | 12359.0 |
| 29743 | 538826 | 22749 | feltcraft princess charlotte doll | 1 | 2018-12-12 12:58:00 | 3.75 | 12370.0 |
| 29744 | 538826 | 22749 | feltcraft princess charlotte doll | 1 | 2018-12-12 12:58:00 | 3.75 | 12370.0 |
| 479879 | 577228 | 22144 | christmas craft little friends | 1 | 2019-11-16 12:07:00 | 2.10 | 12391.0 |
| 479887 | 577228 | 22144 | christmas craft little friends | 1 | 2019-11-16 12:07:00 | 2.10 | 12391.0 |
| 479895 | 577228 | 22270 | happy easter hanging decoration | 1 | 2019-11-16 12:07:00 | 3.75 | 12391.0 |
| 479898 | 577228 | 22270 | happy easter hanging decoration | 1 | 2019-11-16 12:07:00 | 3.75 | 12391.0 |
# finding number and share of duplicates
print('Duplicates:', orders.duplicated().sum())
print('Share in %: {:.2%}'.format(orders.duplicated().sum()/len(orders)))
Duplicates: 5268 Share in %: 0.97%
# dropping the duplicates and checking how many orders are left
orders = orders.drop_duplicates(keep='last').copy()
print('Remaining rows:', len(orders))
Remaining rows: 535187
There is no explanation for the duplicates. If they were in the same order, the items should have been added up in one row. Since they make up less then 1% of our data we dropped them altogether.
# adding date and month columns
orders['date'] = orders['timestamp'].astype('datetime64[D]')
orders['month'] = orders['timestamp'].astype('datetime64[M]')
# caluclating the total price for the items
orders['total_price'] = orders['qty'] * orders['unit_price']
We added columns for date and month and calculated the total price for each row, based on quantity and unitprice.
When we looked at the general information on our data we already saw that there are a few very high values. Even though some of our customers maybe retailers, there could also be mistakes from typing in quantities or prices manually. Let's take a look.
# finding the 99.99 percentile thresholds
up_percentile = np.percentile(orders['unit_price'], [99.99])[0]
qty_percentile = np.percentile(orders['qty'], [99.99])[0]
# plotting boxplots for unit price and quantity with and without outliers
plt.figure(figsize=(15,9))
plt.suptitle('Boxplots with and without extreme outliers', fontsize=22)
plt.subplot(2,2,1)
plt.title('Distribution of Quantity (all data)')
plt.boxplot(orders['qty'])
plt.subplot(2,2,2)
plt.title('Distribution of Quantity (99.99 percentile)')
plt.boxplot(orders[abs(orders['qty']) < qty_percentile]['qty'])
plt.subplot(2,2,3)
plt.title('Distribution of unit price (all data)')
plt.boxplot(orders['unit_price'])
plt.subplot(2,2,4)
plt.title('Distribution of unit price (99.99 percentile)')
plt.boxplot(orders[abs(orders['unit_price']) < qty_percentile]['unit_price'])
plt.show()
There are two extreme outliers in quantity with around 80.000. When we only use the entries within the 99.99 percentile range the minimum and maximum values are 1.500 (positive and negative). There still are a lot of values that are outside the box, but there are no wide gaps.
In the unit prices we had one very low negative price and one outlier with a unitprice of around 40.000. Without upper percentiles there are almost no negative values and the maximum is around 1.5000. Again, there are no wide gaps but there are still a lot of outliers that don't fall within the most common price ranges. We will still keep them, since they were acutal orders and need to be taken into account.
Before we remove the outliers let's check the actual values.
# printing the number of buyers and values for different percentiles
print('\033[1m' + '99.99 percentiles' + '\033[0m')
print('Quantity max:', orders['qty'].max())
print('Quantity percentile:', round(qty_percentile, 2))
print()
print('Unit price max:', orders['unit_price'].max())
print('Unit price percentile:', round(up_percentile, 1))
99.99 percentiles
Quantity max: 80995
Quantity percentile: 1440.0
Unit price max: 38970.0
Unit price percentile: 2234.6
Even though we chose a very conservative percentile that keeps 99.99% of our data there are outliers with much higher values. For the quantity the maximum value is 80.995 while only 0.01% of the entries have a value of more than 1.440.
For the unit price we have a similar outcome. The highest unit price is 38.970 while only 0.01% of the unit prices are higher than 2.234.
Now let's remove those outliers.
# removing the outliers from the actual dataframe
print(len(orders))
orders = orders[abs(orders['qty']) < qty_percentile]
print(len(orders))
orders = orders[abs(orders['qty']) < up_percentile]
print(len(orders))
535187 535096 535096
We are left with 535.096 entries. Dropping the rows with extreme unitprices didn't make a difference, they were probably the same as the ones with abnormally high quantites. This confirms that those were probably mistakes and we were right to remove them.
We will remove postage, discount, dotcom postage, manual, samples, amazon fee, adjust bad debt and cruk commission from our data. These seem to be inventory counts. Customers do pay postage fees, but they are not an actual product that they order. So they are not relevant to the questions that we want to answer.
# listing descriptions (or parts of descriptions) that we want to remove
inventory_desc = [' aside', '20713', '\\?', '^carriage', 'adjust', 'alan', 'amazon', 'away', 'breakages',
'broken', 'c2', 'cargo', 'charges', 'code', 'corr', 'counted', 'cracked', 'credit',
'cruk commission', 'crushed', 'damage', 'dotcom', 'ebay', 'error', 'faulty',
'fba', 'find', 'found', 'import', 'lewis', 'lost', 'mailout', 'manual', 'marked', 'mia',
'missing', 'mix up', 'mixed up', 'next day', 'oops', 'order', 'packing', 'postage',
'rcvd', 'returned', 'sample', 'showroom', 'smashed', 'sold ', 'stock check', 'test',
'to push', 'unsale', 'website', 'wet', 'wrong', 'dagamed', 'mouldy', 'display']
# removing entries where the descriptions contains our words
for word in inventory_desc:
orders = orders[~orders['desc'].str.contains(word)]
orders = orders[orders['desc'] != 'check']
# checking the number of entries that are left
print(len(orders))
531249
Removing the entries lost us another few hundred rows, but we still have 531.249 entries to work with and we don't have to worry about manuall adjustments and mistakes. The only thing left should be discounts and returns, which we will keep for the calculation of revenues and check sums.
We might have some orders that were returned/canceled, where we don't have the information on the original order. In this case the first order_id for a user would be a cancellation order with negative quantities. We will remove them from our dataset.
orders['desc']=orders['desc'].str.replace(',','')
# sorting orders by timestamp and and finding the first order_ids
# for each customer if they have negative quantities
earlier_returns = (orders.sort_values('timestamp')
.groupby(['customer_id', 'desc'], as_index=False)
.agg({'order_id':'first', 'qty':'first'})
.query('qty <= 0'))
# removing those orders from the table
orders = orders[~orders['order_id'].isin(earlier_returns['order_id'])]
len(orders)
529277
We removed a few hundred entries but are still left with 529.271 entries. But now we are sure that the first entry for each customer is an actual order and not a return.
Since we can't tell which return belongs to which order, let's see if we can find the right order that matches the return. This won't be 100% accurate since sometimes
# removing the order_id from returns where we don't have a customer_id
orders.loc[(orders['customer_id'].notnull()) & (orders['qty'] < 0), 'order_id'] = np.nan
#sorting by timestamp, description and customer_id
orders = orders.sort_values(['timestamp', 'customer_id','desc'])
# filling missing customer_ids with "Unkown"
orders['customer_id'] = orders['customer_id'].fillna('Unknown')
# filling the return order_ids with the last order_id from the same customer
orders['order_id'] = (orders.sort_values(['timestamp', 'customer_id','desc'])
.groupby('customer_id')['order_id']
.transform(lambda x: x.ffill()))
#sorting by timestamp, description and customer_id
orders = orders.sort_values(['timestamp', 'customer_id','desc'])
# filling missing customer_ids with "Unkown"
orders['customer_id'] = orders['customer_id'].fillna('Unknown')
# filling the return order_ids with the last order_id from the same customer
orders['order_id'] = (orders.sort_values(['timestamp', 'customer_id','desc'])
.groupby('customer_id')['order_id']
.transform(lambda x: x.ffill()))
Let's check if this worked or if there are returns left.
# finding the orders that have no order_id
print(orders['order_id'].isnull().sum())
orders[orders['order_id'].isnull()]
10
| order_id | code | desc | qty | timestamp | unit_price | customer_id | date | month | total_price | |
|---|---|---|---|---|---|---|---|---|---|---|
| 21698 | NaN | 85048 | 15cm christmas glass ball 20 lights | -1 | 2018-12-07 15:23:00 | 7.95 | 16042.0 | 2018-12-07 | 2018-12-01 | -7.95 |
| 21697 | NaN | 22780 | light garland butterfiles pink | -4 | 2018-12-07 15:23:00 | 4.25 | 16042.0 | 2018-12-07 | 2018-12-01 | -17.00 |
| 28562 | NaN | 22636 | childs breakfast set circus parade | -1 | 2018-12-11 14:38:00 | 8.50 | 12476.0 | 2018-12-11 | 2018-12-01 | -8.50 |
| 28563 | NaN | 21658 | glass beurre dish | -1 | 2018-12-11 14:38:00 | 3.95 | 12476.0 | 2018-12-11 | 2018-12-01 | -3.95 |
| 31964 | NaN | 22461 | savoy art deco clock | -1 | 2018-12-13 14:33:00 | 12.75 | 15535.0 | 2018-12-13 | 2018-12-01 | -12.75 |
| 32373 | NaN | 21906 | pharmacie first aid tin | -11 | 2018-12-13 16:40:00 | 6.75 | 18230.0 | 2018-12-13 | 2018-12-01 | -74.25 |
| 32660 | NaN | 22766 | photo frame cornice | -2 | 2018-12-14 10:30:00 | 2.95 | 16996.0 | 2018-12-14 | 2018-12-01 | -5.90 |
| 87603 | NaN | 22423 | regency cakestand 3 tier | -5 | 2019-02-09 10:49:00 | 10.95 | 14441.0 | 2019-02-09 | 2019-02-01 | -54.75 |
| 120761 | NaN | 22423 | regency cakestand 3 tier | -1 | 2019-03-13 15:42:00 | 10.95 | 15215.0 | 2019-03-13 | 2019-03-01 | -10.95 |
| 526105 | NaN | 22666 | recipe box pantry yellow design | -1 | 2019-12-03 13:20:00 | 2.95 | 14777.0 | 2019-12-03 | 2019-12-01 | -2.95 |
These are probably returns from users where the first two or three order_id where return orders. We only removed the first one if it had a negative quantity value. We will drop these as well.
# dropping the orders without id
orders = orders[orders['order_id'].notnull()]
len(orders)
529267
Now we want to eliminate items that were returned. To do this we will group the same items in the same order together and sum up their quantities. So if 10 items were purchased and 5 were returned, our new dataframe will have one row with 5 items for this order.
# grouping and aggregating the dataframe
orders = (orders.groupby(['order_id',
'code',
'desc',
'unit_price',
'customer_id'], as_index=False)
.agg({'timestamp':'first',
'date':'first',
'month':'first',
'qty':'sum',
'total_price':'sum'}))
# rearranging the columns
orders = orders[['order_id', 'customer_id', 'code', 'desc', 'qty', 'unit_price', 'total_price', 'timestamp', 'date', 'month']]
# removing rows where the quanity now is less than 0 besides the item not being a discount
orders = orders[(orders['desc'] == "discount") | ((orders['qty'] > 0) & (orders['desc'] != "discount"))]
len(orders)
515987
We removed a little over 10.000 entries this way. At this point our dataframe should only have information on items that were purchased and not returned or cancelled as well as discounts (which we can't assign to a certain item, but to the total check for the order).
We need to sort our unique items into a few categories so that we can analyse which categories are more popular, how many revenue they generate etc. First we should make sure that any typos or different descriptions are cleaned. To do this we will only keep one description per stock code and use it for every item with the same code.
# choosing one description for the code (the first that we find)
desc_clean = orders.groupby('code', as_index=False).agg({'desc':'first'})
# naming the columns and merging the cleaned description to the orders table
desc_clean.columns = ['code', 'desc_clean']
orders = orders.merge(desc_clean, how='left', on='code')
print('Unique Descriptions:', len(orders['desc'].unique()))
print('Unique Descriptions after cleaning:', len(orders['desc_clean'].unique()))
Unique Descriptions: 3983 Unique Descriptions after cleaning: 3764
There were around 200 items that had the same stock code but a different description. We cleaned the descriptions and now we know that we have 3992 unique products.
Now let's define our categories and add some buzzwords to them. To find these I actually looked at almost all descriptions and decided on the category. In real life this information would be available. Since the description isn't always clear on the type of item I found that using lemmatization or textblob wouldn't spare me the work so I didn't use it in this notebook.
# defining categories and buzzwords
household = [' mat', ' pegs', ' tray', 'ashtray', 'bath', 'brush', 'cake towel', 'chair', 'clean',
'coaster', 'curtain', 'fire bucket', 'first aid', 'hand towel', 'home', 'hot water',
'ironing', 'matches', 'napkin', 'organiser', 'pill box', 'place', 'plaster', 'pot holder',
'repair', 'shelf', 'shoe', 'shower', 'silicon', 'snack', 'soap', 'spirit level', 'swat',
'table run', 'tidy', 'tile', 'tissue', 'tool set', 'washing', 'waste']
candles_lamps = [' light', 'candle', 'chandelier', 'incense', 'lamp', 'lantern', 'light chain',
'lightbulb', 'night light', 'nightlight', 'oil burner', 'shade', 't-light',
'tea light', 'ubo', 'votive']
accessoires = [' clips', ' hair', '^key', 'bangle', 'brac', 'brooch', 'choker', 'earr', 'gloss',
'hairclip', 'key chain', 'key fob', 'key-chain', 'lipstick', 'neckl', 'phone char',
'ring', 'scarf', 'sunglasses', 'backpack', 'bag', 'purse', 'sack', 'shopper', 'tote']
living = [' flag', ' pony', ' rack', ' sign', ' stool', ' tin', 'art flower', 'artif', 'artii',
'assorted flower', 'basin', 'bird house', 'black board', 'blackboard', 'block', 'bobbles',
'bottle', 'boxes', 'buddha', 'bunting', 'c/cover', 'cabinet', 'calendar', 'camphor',
'cannister vintage', 'canvas', 'ceramic', 'chalkboard', 'chest', 'chime', 'clock', 'cloth',
'coat rack', 'crackers', 'crates', 'cushion', 'deco', 'doormat', 'doorsign', 'drawer',
'e chain', 'enamel pot', 'feather tree', 'felt', 'feltcraft', 'filler', 'flag ', 'flag of',
'flying', 'folkart', 'fragrance', 'frame', 'french style basket', 'garland', 'geisha',
'glass jar', 'hanger', 'hanging', 'hen house', 'honeycomb', 'hook', 'hot water', 'image',
'jewel', 'keepsake', 'knob', 'letter rack', 'liners', 'magnet', 'memo board', 'memoboard',
'metal box', 'metal cat', 'mirror', 'mobile', 'money', 'news', 'nursery', 'ornament',
'oval box', 'painted', 'photo', 'piggy', 'plaque', 'porcelain', 'pouf', 'print', 'quilt',
'rosette', 'round box', 'round container', 'scented fl', 'scottie dog', 'screen', 'seat',
'stop', 'storage', 'storage box', 'storage jar', 'table', 'throw', 'toadstool', 'treasure tin',
'trinket', 'vase', 'wall', 'wicker', 'wind', 'wood letters', 'wreath', 'push down', 'star', 'heart']
kitchen = [' glass', ' jar', ' pan', ' wine', 'apron', 'baking', 'beaker', 'bell jar', 'biscuit', 'bowl',
'bread bin', 'breakfast', 'caddy', 'cake ', 'cases', 'chalice', 'chopping', 'chopstick', 'cloche',
'cocktail', 'coffee', 'colander', 'cooking', 'cosy', 'cup', 'cutlery', 'cutter', 'dish', 'doil',
'egg ', 'food cover', 'frying', 'jam ', 'jampot', 'jug', 'kitchen scale', 'ladle', 'lolly', 'lunch',
'mould', 'mug', 'oven glove', 'oven mitt', 'pantry', 'plate', 'popcorn', 'recipe', 'saucer',
'spoon', 'squeezer', 'strainer', 'straws', 'sugar', 'tea', 'tea towel', 'tier', 'toast', 'towel',
'tumbler', 'utensil', 'wine bottle']
garden_travel = ['dovecote', 'feeder', 'feeding', 'garden ', 'gardener', 'gerden', 'gnome', 'grow', 'hammock',
'hamper', 'marker', 'parasol', 'plant', 'sow', 'spade', 'twine', 'watering', 'wheelbarrow']
diy_stationary = [' fan', 'badge', 'balloon', 'bicycle', 'book', 'button box', 'card', 'clam', 'clay',
'confetti', 'crayon', 'envelope', 'eraser', 'gift tag', 'hand warmer', 'invite', 'journal',
'knitting', 'knitting kit', 'lariat', 'luggage tag', 'note', 'own eggs', 'paint set',
'paper chain', 'paper craft', 'papercraft', 'party bag', 'passport', 'patches', 'pen',
'punch', 'rain ', 'ribbon', 'ruler', 'scissor', 'sew ', 'sewing', 'stamp', 'stationery',
'stencil', 'sticker', 'tape', 'tinsel', 'torch', 'travel', 'umbrella', 'weight', 'wrap',
'writing', 'party charms']
holidays = ['advent', 'birthday', 'bunnies', 'bunny', 'chicks in basket', 'christmas', 'cones', 'easter',
'fluffy', 'knitted hen', 'paper egg', 'santa', 'snowflake']
toys = ['baby gift', 'baby mouse', 'bingo', 'bubbles', 'chalk sticks', 'cluster slide', 'creepy', 'crosses',
'dinosaur', 'doll', 'domino', 'farm', 'game', 'glider', 'glow', 'harmonica', 'helicopter', 'inflat',
'jigsaw', 'ladders', 'ludo', 'magic', 'monster', 'ninja', 'play', 'pop box', 'puppet', 'rocking',
'skip', 'skittles', 'snake', 'space', 'spinning', 'tails', 'tattoos', 'teddy', 'toy']
other = [' bib', 'battery', 'boombox', 'calc', 'candy spot', 'cat food', 'charger', 'collar', 'dog bowl',
'dog can', 'dog food', 'doggy', 'electronic', 'fairy pole', 'flannel', 'grass skirt', 'headphone',
'radio', 'slipper', 'sombrero', 'speaker', 'stress', 'white base']
# creating a list of the descritpions and adding an empty category column
descriptions = orders.groupby('desc_clean', as_index=False)['qty'].sum()
descriptions['category'] = ' '
# defining the lists that hold our buzzwords and the names of the categories
search = [toys, kitchen, accessoires, household, living, garden_travel, candles_lamps, diy_stationary, other, holidays]
replace = ['Toys', 'Kitchen', 'Fashion & Accessoires', 'Household', 'Living',
'Travel & Garden', 'Candles & Lamps', 'DIY & Stationary', 'Other', 'Holidays']
# searching the descriptions fpr buzzwords and adding the category
for search, replace in zip(search, replace):
for word in search:
descriptions.loc[descriptions['desc_clean'].str.contains(word), 'category'] = replace
# adding discount to the categories
descriptions.loc[descriptions['category'] == ' ', 'category'] = "discount"
# finding the number of items from each category that were sold
descriptions.groupby('category', as_index=False)['qty'].sum().sort_values('qty', ascending = False)
| category | qty | |
|---|---|---|
| 6 | Living | 1446077 |
| 1 | DIY & Stationary | 893576 |
| 5 | Kitchen | 842324 |
| 2 | Fashion & Accessoires | 663417 |
| 3 | Holidays | 445792 |
| 0 | Candles & Lamps | 442731 |
| 4 | Household | 206078 |
| 8 | Toys | 185437 |
| 9 | Travel & Garden | 102743 |
| 7 | Other | 26525 |
| 10 | discount | -770 |
# adding the category to each ordered product
orders = orders.merge(descriptions[['desc_clean', 'category']], how='left', on='desc_clean')
# let's check if it worked
orders.head()
| order_id | customer_id | code | desc | qty | unit_price | total_price | timestamp | date | month | desc_clean | category | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | 536365 | 17850.0 | 21730 | glass star frosted t-light holder | 6 | 4.25 | 25.50 | 2018-11-29 08:26:00 | 2018-11-29 | 2018-11-01 | glass star frosted t-light holder | Candles & Lamps |
| 1 | 536365 | 17850.0 | 22752 | set 7 babushka nesting boxes | 2 | 7.65 | 15.30 | 2018-11-29 08:26:00 | 2018-11-29 | 2018-11-01 | set 7 babushka nesting boxes | Living |
| 2 | 536365 | 17850.0 | 71053 | white metal lantern | 6 | 3.39 | 20.34 | 2018-11-29 08:26:00 | 2018-11-29 | 2018-11-01 | white metal lantern | Candles & Lamps |
| 3 | 536365 | 17850.0 | 84029e | red woolly hottie white heart. | 6 | 3.39 | 20.34 | 2018-11-29 08:26:00 | 2018-11-29 | 2018-11-01 | red woolly hottie white heart. | Living |
| 4 | 536365 | 17850.0 | 84029g | knitted union flag hot water bottle | 6 | 3.39 | 20.34 | 2018-11-29 08:26:00 | 2018-11-29 | 2018-11-01 | knitted union flag hot water bottle | Living |
We successfully sorted all of our products into a few categories. Most items that were sold are from the categories Living, DIY & Stationary and Kitchen. But we will go into this deeper in the upcomming analysis. Now that we're done with categorising we are ready to explore our customers behaviour more in depth.
We will check some general metrics to get to know the data and our customers:
We have a lot of orders without customer_ids. We won't use those for the number of customers or average number of orders. Those values wouldn't be correct with them. But we can use them for the average check since we do have an order_id for those entries as well.
# finding the number of orders for each customer
orders_per_customer = (orders.query('customer_id != "Unknown"')
.groupby('customer_id', as_index=False)
.agg({'order_id':'nunique'}))
# plotting the number of orders per customer
fig = px.histogram(orders_per_customer, x="order_id",
title='Histogram of Number of Orders',
labels={'order_id':'Number of Orders'}, # can specify one label per df column
opacity=0.8,
log_y=True, # represent bars with log scale
color_discrete_sequence=['indianred'] # color of histogram bars
)
fig.update_layout(font_family="Bahnschrift")
fig.show()
# finding the average number of orders and the average check
print('Average number of orders:', round(orders_per_customer.mean()[1], 2))
Average number of orders: 4.23
The majority of our customers only ordered a few times in the given period. 1.503 only ordered once and from there the count of customers steadily declines. Very few customers ordered more than 40 times. Those customers probably are retailers that frequently restock their stores.
# finding the total value of each order
checks = (orders.query('customer_id != "Unknown"')
.groupby('order_id', as_index=False)
.agg({'total_price':'sum'}).query('total_price > 0'))
fig = px.histogram(checks, x="total_price",
title='Histogram of average check',
labels={'total_price':'check value'}, # can specify one label per df column
opacity=0.8,
log_y=True, # represent bars with log scale
color_discrete_sequence=['indianred'] # color of histogram bars
)
fig.update_layout(font_family="Bahnschrift")
fig.show()
print('Mean check:', round(checks['total_price'].mean(), 2))
print('Median check:', round(checks['total_price'].median(), 2))
Mean check: 453.93 Median check: 299.22
On average our customers spend only up to 500 for one order. There are a few very costly orders, some for more than 20.000, but it is very unusual to find an order with a total price of more than 2.000.
# calculating the total sales for each category
# (excluding discount, because we don't know for which item this was)
total_sales = (orders.query('desc != "discount"')
.groupby(['category'], as_index=False)['total_price']
.sum()
.sort_values('total_price', ascending=False))
total_sales['total_price'] = round(total_sales['total_price'] / 1000000, 2)
total_sales
| category | total_price | |
|---|---|---|
| 6 | Living | 3.84 |
| 5 | Kitchen | 1.68 |
| 2 | Fashion & Accessoires | 1.13 |
| 1 | DIY & Stationary | 0.98 |
| 0 | Candles & Lamps | 0.79 |
| 3 | Holidays | 0.61 |
| 9 | Travel & Garden | 0.27 |
| 4 | Household | 0.24 |
| 8 | Toys | 0.22 |
| 7 | Other | 0.06 |
# plotting the sales by category as pie chart
fig = px.pie(total_sales,
values='total_price',
names='category',
title='Revenue Share by Category (in million)',
color_discrete_sequence=px.colors.qualitative.Pastel)
fig.update_traces(textinfo='value + percent')
fig.update_layout(font_family="Bahnschrift")
fig.show()
We already found that the most items we sold were from the category Living, and we also had the highest revenue from this category. 39% of the sales were in this category, which is a total of 3.8 Million. Surprisingly "DIY & Stationary" items were the second most often sold, but they are only in fourth place when it comes to sales. Only 9.95% of our revenue came from this category (0.9 Million).
The category with the second and third highest sales are Kitchen and Living, while Household items, Travel & Garden and toys are the categories with the least revenue (only less than 3% each).
# calculating monthly sales by category (without discount)
monthly_sales = (orders.query('desc != "discount"')
.groupby(['category', 'month'], as_index=False)['total_price']
.sum()
.sort_values('total_price', ascending=False))
# plotting the monthly sales by category as stacked area chart
fig = px.area(monthly_sales,
x="month",
y="total_price",
color="category",
color_discrete_sequence=px.colors.qualitative.Pastel)
fig.update_layout(title='Total monthly sales by category',
font_family="Bahnschrift")
fig.show()
The category 'Living' was the best one for all months. But we saw a steady increase in revenue from May 2019 to November 2019.
In general sales went up from May 2019 and reached an alltime high in November 2019. Since we only have data for one year we can't tell if this is a seasonal pattern (the month leading up to Christmas see higher revenue) or if our sales went up in general, maybe because of adding interesting items to our portfolio or marketing campaigns that attracted new customers.
The share that each category has in our sales stays more or less the same all year, except for Living, DIY & Stationary and Holidays, who all increased from August to November 2019.
# calculating the total prices of orders
monthly_check = orders.groupby(['month','order_id'], as_index=False)['total_price'].sum()
# calculating the mean and median price of orders by month
monthly_avg = monthly_check.groupby('month', as_index=False).agg({'total_price':['mean', 'median']})
monthly_avg.columns = ['month', 'mean', 'median']
# plotting the mean and median check by month as linechart
fig = px.line(monthly_avg,
x="month",
y=["mean", "median"],
title='Mean and median order size',
color_discrete_sequence=px.colors.qualitative.Pastel)
fig.update_traces(line=dict(width=4))
fig.update_layout(font_family="Bahnschrift")
fig.show()
The median order size is more or less consistent throughout the year. We had a small decrease in June 2019 and and increase in September 2019, but overall the mean order size is between 270 and 300.
The mean order sizes changed a bit more, for example it was the lowest in April 2019 with only 400 and the highest in September 20 with more than 550. We obviously had some very small orders in April and maybe a few very large ones in September.
# calculating the sum for each customer and order
check_cus = (orders.query('customer_id != "Unknown"')
.groupby(['customer_id', 'order_id'], as_index=False)['total_price']
.sum())
# calculating the mean check for each customer
mean_check_cus = check_cus.groupby('customer_id').mean()
# plotting the distribution of average order size as histogram
fig = px.histogram(mean_check_cus, x="total_price",
title='Histogram of average customer check',
labels={'total_price':'Avg check'}, # can specify one label per df column
opacity=0.8,
log_y=True, # represent bars with log scale
color_discrete_sequence=px.colors.qualitative.Pastel # color of histogram bars
)
fig.update_layout(font_family="Bahnschrift")
fig.show()
Most orders are less than 2.000 in total. There are only a few that have total prices of up to 12.000 (even tough we cleaned items that had a very high quantity or unit price).
Not many orders have a total price of more than 500, from there to 2.000 the number of orders decreases steadily.
# Calculated number of unique orders per month
orders_per_month = orders.groupby('month')['order_id'].nunique().reset_index()
# displaying them as lineplot
fig = px.line(orders_per_month,
x="month",
y="order_id",
title='number of orders per month',
labels = {'order_id':'number of orders'},
color_discrete_sequence=px.colors.qualitative.Pastel)
fig.update_traces(line=dict(width=4))
fig.update_layout(font_family="Bahnschrift")
fig.show()
The number of orders slowly increases, but there is a lot of fluctuation. From December 2018 to April 2019 the number is between 1.000 and 1.500 orders per month. In May 2019 we saw a small peak with almost 1.750 orders and in October and November the number of orders went up very quick, up to 2750 orders. This was probably also related to the Christmas season.
# find unit price, number of sold items and number of orders containing this item
items_per_order = (orders.query('desc != "discount"')
.groupby(['category','desc_clean'], as_index=False)
.agg({'unit_price':'mean', 'qty':'sum', 'order_id':'nunique'})
.query('qty > 0'))
# finding the average unit price for each category
avg_unit_price = orders.groupby('category')['unit_price'].mean()
avg_unit_price.name = 'avg_cat_unit_price'
# adding the average unit price to the items
items_per_order = items_per_order.merge(avg_unit_price, how='left', on='category')
# finding the 10 most sold items
most_sold = items_per_order.nlargest(10, 'qty')
# finding items that were sold less than 10 times and grouping them by category
least_sold = (items_per_order.query('qty < 10')
.groupby('category', as_index = False)
.agg({'unit_price':'mean','qty':'sum', 'avg_cat_unit_price':'mean'})
.sort_values('qty', ascending=False))
# calculating the difference between unit price and avg unit price in the same category
most_sold['difference'] = most_sold['unit_price'] - most_sold['avg_cat_unit_price']
least_sold['difference'] = least_sold['unit_price'] - least_sold['avg_cat_unit_price']
# displaying the tables
display(most_sold)
display(least_sold)
| category | desc_clean | unit_price | qty | order_id | avg_cat_unit_price | difference | |
|---|---|---|---|---|---|---|---|
| 1125 | Fashion & Accessoires | jumbo bag red retrospot | 2.488009 | 47603 | 2085 | 2.411720 | 0.076289 |
| 2258 | Kitchen | small popcorn holder | 1.047841 | 45090 | 1384 | 3.649244 | -2.601403 |
| 3642 | Toys | world war 2 gliders asstd designs | 0.321856 | 41847 | 528 | 2.761172 | -2.439316 |
| 2109 | Kitchen | pack of 72 retrospot cake cases | 0.765445 | 34884 | 1316 | 3.649244 | -2.883799 |
| 317 | Candles & Lamps | white hanging heart t-light holder | 3.220903 | 33497 | 2261 | 2.766235 | 0.454668 |
| 2372 | Living | assorted colour bird ornament | 1.723255 | 33461 | 1452 | 4.199706 | -2.476451 |
| 591 | DIY & Stationary | mini paint set vintage | 0.786101 | 26512 | 377 | 1.998472 | -1.212371 |
| 1759 | Household | pack of 12 london tissues | 0.450195 | 26299 | 512 | 2.647627 | -2.197432 |
| 200 | Candles & Lamps | rabbit night light | 2.395662 | 24771 | 988 | 2.766235 | -0.370573 |
| 311 | Candles & Lamps | victorian glass hanging t-light | 1.646399 | 24171 | 1033 | 2.766235 | -1.119836 |
| category | unit_price | qty | avg_cat_unit_price | difference | |
|---|---|---|---|---|---|
| 2 | Fashion & Accessoires | 5.815321 | 566 | 2.411720 | 3.403602 |
| 6 | Living | 8.335182 | 255 | 4.199706 | 4.135476 |
| 1 | DIY & Stationary | 3.574458 | 198 | 1.998472 | 1.575986 |
| 0 | Candles & Lamps | 5.821883 | 115 | 2.766235 | 3.055648 |
| 3 | Holidays | 2.610000 | 73 | 2.182205 | 0.427795 |
| 5 | Kitchen | 4.683704 | 54 | 3.649244 | 1.034460 |
| 4 | Household | 3.476429 | 31 | 2.647627 | 0.828802 |
| 9 | Travel & Garden | 5.802857 | 25 | 4.758733 | 1.044124 |
| 8 | Toys | 4.483333 | 13 | 2.761172 | 1.722161 |
| 7 | Other | 10.038333 | 9 | 3.886401 | 6.151932 |
The most sold item is the "Jumbo bag red retrospot" with 47.603 sold items in 2.085 orders. It is a bit more expansive than other products from the category Fashion & Accessoires but still very popular.
Other popular items are usually cheaper than the average unit price from their category. For example thr "Assorted Colour Bird Ornament" only costs 1.72 while other products in the category living have an average unit price of 4.19.
The less popular products are usually a bit more expensive than the average unit price of their category.
# plotting products by the quantity and number of orders they were in
fig = px.scatter(items_per_order,
x="qty",
y="order_id",
size="qty",
color="category",
hover_name="desc_clean",
log_x=True,
size_max=60,
title='Most sold Items',
labels={'qty':'Number of items sold', 'order_id':'Number of orders containing item'},
color_discrete_sequence=px.colors.qualitative.Pastel)
fig.update_layout(font_family="Bahnschrift")
fig.show()
In this visualisation we can see that besides the jumbo bag red retrospot the white hanging hear t-light holder had the highest combination of quantity and number of different orders it was in. In this graph we find some products in the top right corner that weren't in our list of top 10 most sold items, because we only looked for the number of sold items. Here we can see that some items were ordered in bulk from the same customers while other were popular with more people.
In most cases, a product moves through three stages – Introduction, Maturity and Decline. We will find the first and last month each product was ordered, the month with the highest revenue from them and try to calculate how long a product usually is interesting for our customers or if some items don't sell at all anymore.
# finding the first and last date an item was sold
lifespan = (orders.query('qty > 0 & desc != "discount"')
.groupby(['category', 'desc_clean'], as_index=False)
.agg({'date':['first', 'last'], 'order_id':'count'}))
lifespan.columns = ['category','desc_clean', 'first_sale', 'last_sale', 'n_orders']
# calculating the number of days a product was present in our shop
lifespan['lifespan'] = ((lifespan['last_sale'] - lifespan['first_sale'])/ np.timedelta64(1, 'D')).astype(int)
# showing the lifespan of products for each category as boxplots
fig = go.Figure()
for category in list(lifespan['category'].unique()):
fig.add_trace(go.Box(
y=lifespan.query('category == @category')['lifespan'],
name=category))
fig.update_layout(title='Lifespan (in days) of Products per Category',
showlegend=False,
font_family="Bahnschrift")
fig.show()
# products that were only sold on one day
len(lifespan.query('lifespan == 0'))
128
Products from the categories Toys, Household and Other usually have the longest lifespans and very few products there have short lifespans. These catgegories contain timeless items that are interesting for our customers throughout the whole year.
The Holdiday category on the other hand contains many products with short lifespans, they only sell during a few months of the year. Since we only have data for one year these products could of course become interesting again for next Christmas or Easter, but they don't sell well in every month.
There are 128 products that were only sold on one day, so they have a lifespan of 0.
We want to divide our products into main and additional assortment. To to this we will find items that were sold without any other products, meaning that the customer placed an order specifically to buy this item. All other items will be considered additional.
# finding products that were sold by themselves
items_in_order = orders.groupby('order_id', as_index = False).agg({'desc_clean':['nunique','first']})
items_in_order.columns = ['order_id', 'unique_items', 'desc_clean']
items_in_order = items_in_order.query('unique_items == 1')
len(items_in_order)
1462
There are 1.462 products that at least once were the only item in an order.
# calculating how often these products were sold by themselces
single_orders = items_in_order.groupby('desc_clean', as_index=False)['unique_items'].count()
single_orders.columns = ['desc_clean', 'n_orders']
single_orders.sort_values('n_orders', ascending=False).head()
| desc_clean | n_orders | |
|---|---|---|
| 109 | chilli lights | 30 |
| 445 | rabbit night light | 29 |
| 477 | regency cakestand 3 tier | 23 |
| 590 | small popcorn holder | 19 |
| 672 | white hanging heart t-light holder | 19 |
# finding the total number of orders that these products were in
single_total_orders = (orders[orders['desc_clean'].isin(single_orders['desc_clean'])]
.groupby(['category', 'desc_clean'], as_index=False)['order_id']
.count()
.sort_values('order_id', ascending=False))
# defining the products as "main" assortement and all others as "additional"
main_assortement = single_orders['desc_clean']
orders.loc[orders['desc_clean'].isin(main_assortement), 'assortement'] = 'main'
orders['assortement'].fillna('additional', inplace=True)
# calculating number of items in the assortements
assmt_share = orders.groupby('assortement')['desc_clean'].nunique().reset_index()
# displaying the share as pie chart
fig = px.pie(assmt_share,
values='desc_clean',
names='assortement',
title='Unique products in main and additional assortement',
color_discrete_sequence=px.colors.qualitative.Pastel)
fig.update_layout(title={'y':0.9,
'x':0.5,
'xanchor': 'center',
'yanchor': 'top'},
legend=dict(yanchor="top",
y=0.8,
xanchor="left",
x=0.75),
font_family="Bahnschrift")
fig.update_traces(textinfo='value + percent')
fig.show()
Only 18.9 % of our products are considered "main" assortement. The vast majority is from the additional assortement, meaning that those products were never sold by themselves.
# finding the total number of sold items from each category and assortement
asstm_split = (orders.query('desc != "discount"')
.groupby(['category', 'assortement'], as_index=False)
.agg({'qty':'sum'}))
# plotting them as bar chart, grouped by category
fig = px.bar(asstm_split,
x="category",
y="qty",
color='assortement',
barmode='group',
text="qty",
labels={'qty':'number of items sold'},
title='Sold items by category and assortement (main and additional)',
color_discrete_sequence=px.colors.qualitative.Pastel)
fig.update_layout(font_family="Bahnschrift")
fig.show()
For the categories Kitchen, Other and Travel & Garden there is a balance between the quantity of sold items from the main and the additional assortement. For Living, DIY & Stationary, Holidays, Household and Toys more sold items are from the additional assortement. Only for Candles & Lamps and Fashion Accessoires the quantity of products from the main assortement is higher than from the additional assortement.
back to content - back to Analysis - back to main and additional assortment
# calculating the share of main and additional products for each category
cat_assmt = orders.query('desc != "discount"').groupby(['category', 'assortement']).agg({'desc_clean':'nunique'})
cat_assmt_pcts = cat_assmt.groupby(level=0).apply(lambda x: round(100 * x / float(x.sum())))
cat_assmt_pcts.reset_index(inplace=True)
cat_assmt_pcts.sort_values(['assortement', 'desc_clean'], inplace=True)
# plotting the data as stacked barcharts
fig = px.bar(cat_assmt_pcts,
x="category",
y="desc_clean",
color='assortement',
text="desc_clean",
labels={'desc_clean':'Share of items in category'},
title='Assortement split by category (in %)',
color_discrete_sequence=px.colors.qualitative.Pastel)
fig.update_layout(font_family="Bahnschrift")
fig.show()
The categories Kitchen, Candles & Lamps and Toys have the highest share of products from the main assortement (more than 20% in each category). By our calculation this means that customers place orders only to by a certain product from one of these categories.
Household and Fashion & Accessoires products are more often from the additional assortement with 15% or less. Customers buy something eles and add these products to their cart, but they are not the main reason for the order.
back to content - back to Analysis - back to main and additional assortment
# calculating the total revenue from both assortements in millions
assmt_rev = orders.groupby('assortement', as_index=False)['total_price'].sum()
assmt_rev['rev_in_millions'] = round(assmt_rev['total_price'] / 1000000, 2)
# displaying the share as pie chart
fig = px.pie(assmt_rev,
values='rev_in_millions',
names='assortement',
title='Revenue Share by Category (in million)',
color='assortement',
color_discrete_map={'main':'#F6CF71',
'additional':'#66C5CC'})
fig.update_layout(title={'y':0.9,
'x':0.5,
'xanchor': 'center',
'yanchor': 'top'},
legend=dict(yanchor="top",
y=0.8,
xanchor="left",
x=0.75),
font_family="Bahnschrift")
fig.update_traces(textinfo='value + percent')
fig.show()
Our assignment of "main" and "additional" seems to make sense. Even though there are a lot more products in the additional assortement, more than half of our revenue comes from the main assortement (56.7% or 5.56 Million).
back to content - back to Analysis - back to main and additional assortment
In this step we want to find products that were often sold together. We will not use returned items or discounts for this calculation and limit the bundle size to 2 products.
# only using orders with a positive price (no discounts or returns)
positive_orders = orders.query('total_price > 0').copy()
# adding all the productnames from the same order in a new column
positive_orders['Product Bundle'] = positive_orders.groupby('order_id')['desc_clean'].transform(lambda x: ','.join(x))
# dropping duplicates so that we only have the order id and bundle for each order once
positive_orders = positive_orders[['order_id', 'Product Bundle']].drop_duplicates()
# counting how often a combination of 2 products is present in the product bundles
count = Counter()
for row in positive_orders['Product Bundle']:
row_list = row.split(',')
count.update(Counter(combinations(row_list,2)))
# selecting the 15 most common bundles
bundles = pd.DataFrame(count.most_common(15))
bundles.columns = ['bundle', 'count']
bundles['bundle'] = bundles['bundle'].astype('str')
# cleaning the output format
bundle_clean = []
for row in bundles['bundle']:
bundle = str(row)
bundle = re.sub('\(', '', bundle)
bundle = re.sub('\)', '', bundle)
bundle = re.sub('\'', '', bundle)
bundle = re.sub('\"', '', bundle)
bundle_clean.append(bundle)
bundles['bundle'] = bundle_clean
# displaying the top 15 product bundles and their frequency
fig = go.Figure(go.Bar(x=bundles['count'],
y=bundles['bundle'],
orientation='h',
text=bundles['count'],
textposition='auto'))
fig.update_layout(yaxis={'categoryorder':'total ascending'},
title={'text':'Frequency of top 15 product bundles',
'y':0.9,
'x':0.45,
'xanchor': 'center',
'yanchor': 'top'},
font_family="Bahnschrift")
fig.update_traces(marker_color='rgb(102,197,204)')
fig.show()
The most popular bundle is "Jumbo bag pink polkadot" combined with "jumbo bag red retrospot", they were bought together 825 times. In general a lot of those bags are in out top 15 bundles.
We already established that a lot of our customers probably are retailers, so it makes sense that they might buy different varieties of the same product, like different designs and colors.
We will run a statistical test to find out if our hypothesis is true:
Candles are sold more often in the Winter, because they make your home cozy and comfortable
# finding orders that contained candles
candle_orders = orders.query('category == "Candles & Lamps"')['order_id']
# grouping orders and customer_ids together
candles = orders.groupby(['order_id', 'date', 'month'])['customer_id'].nunique().reset_index()
candles.loc[candles['order_id'].isin(candle_orders), 'contains_candles'] = 1
# adding the information if a candle was part of the order
candles['contains_candles'].fillna(0, inplace=True)
# calculating the share of orders that contained candles for each day
candles = candles.groupby(['month', 'date'], as_index=False)['contains_candles'].mean()
# defining the period "winter"
winter = ['2018-12-01', '2019-01-01', '2019-02-01']
# adding information on the season (winter or other)
candles.loc[candles['month'].isin(winter), 'season'] = "winter"
candles['season'].fillna('Other', inplace=True)
# plotting a histogram split by seasons
fig = px.histogram(candles, x="contains_candles",
color = 'season',
opacity=0.5,
color_discrete_sequence=px.colors.qualitative.Pastel,
labels={'contains_candles':"share of orders that contained candles (by day)"}
)
fig.update_layout(font_family="Bahnschrift",
title="Histograms of orders that contained candles")
fig.show()
# setting the statistical significance level
alpha = 0.05
# defining the samples that we will test
sample_array_1 = candles[candles['month'].isin(winter)]['contains_candles']
sample_array_2 = candles[~candles['month'].isin(winter)]['contains_candles']
# printing the means for both samples
print('Mean share of orders with candles in winter: {:.1%}'.format(sample_array_1.mean()))
print('Mean share of orders with candles in other seasons: {:.1%}'.format(sample_array_2.mean()))
print()
# testing for equal variance to decide on the parameter in the mannwhitneyu test
stat, p = st.levene(sample_array_1.to_numpy(),
sample_array_2.to_numpy(),
center='median',
proportiontocut = 0.05)
print('Levene p-value:', round(p, 3))
if p > 0.05:
equal_var_bool = False
print('Equal Variance = True')
else:
equal_var_bool = True
print('Equal Variance = False')
# Testing for statistical difference
p_value = st.mannwhitneyu(sample_array_1, sample_array_2, equal_var_bool, 'two-sided')[1]
print()
print('P-value:', round(p_value,5))
if (p_value < alpha):
print("We reject the null hypothesis")
print('There is a statistical significant difference between the means of both groups.')
print('The number of orders that contain candles is different in winter months.')
else:
print("We can't reject the null hypothesis")
print('There is no statistical significant difference between the means of both groups.')
print('The number of orders that contain candles is the same throghout the year.')
Mean share of orders with candles in winter: 56.0% Mean share of orders with candles in other seasons: 51.7% Levene p-value: 0.358 Equal Variance = True P-value: 0.00056 We reject the null hypothesis There is a statistical significant difference between the means of both groups. The number of orders that contain candles is different in winter months.
Just as we suspected, there is as statistical significant difference in the share of orders that contained candles between winter and other seasons. The mean is 56% in winter and in other seasons 51.7% of all orders contained candles or lamps. This doesn't look like a big difference, but our test tells us, that there is a statistical difference between both groups.
Key insights and recommendations
Link to the dashboard:
https://public.tableau.com/app/profile/laura.str.mpler/viz/p100eCommerce/SalesDashboard
Link to the presentation:
https://drive.google.com/file/d/1HjEduSYEL1HNWY9QkuBID2RqzwIqmmHg/view?usp=sharing